refactor: Compact schema_generation to <150 LOC#855
Closed
MQ37 wants to merge 1 commit into
Closed
Conversation
Same behavior as the previous commit (set-union merge, type-array unions, format detection, NYC regression coverage). Tightening: - Inline `inferType` into `infer`. - Drop `JsonSchemaArray` / `JsonSchemaObject` types (redundant with `JsonSchemaProperty`). - Drop `SchemaGenerationOptions` type (inlined into signature). - Tests use `it.each` tables for the 6 type-union cases and 5 format-detection cases; 28 \u2192 21 tests, coverage equivalent. src/utils/schema_generation.ts: 122 \u2192 69 LOC tests/unit/schema_generation.test.ts: 265 \u2192 79 LOC Combined: 387 \u2192 148 LOC
jirispilka
requested changes
May 20, 2026
jirispilka
left a comment
Collaborator
There was a problem hiding this comment.
As you might have guessed already, I prefer the human readable version :D
Contributor
Author
|
@jirispilka ahh, these humans - closing then 😄 |
MQ37
added a commit
that referenced
this pull request
May 21, 2026
## Context
`get-dataset-schema`, `call-actor`, and `get-actor-run` all share
`generateSchemaFromItems`, which delegated to `to-json-schema@0.2.5`.
Any two items with different key sets collapsed to
`{type:'array',items:{type:'object'}}` — properties wiped out. Reported
on a real NYC restaurants dataset where ~half the items carried
`markdown` and half didn't.
## Solution
Replaced the library with an in-house inferrer in
`src/utils/schema_generation.ts`. The merge does a set-union of property
keys and recurses; primitive type conflicts emit JSON Schema `type`
arrays (e.g. `["string","null"]`). Drops the `arrayMode` field from
`get-dataset-schema` — it only existed as a workaround for the buggy
`mode:'all'`, and all internal callers were already passing it anyway.
## Worth your attention
- **No external dependency, no supply-chain surface.** `to-json-schema`
was last published in 2020 and the upstream repo is dead. Owning ~120
LOC of pure JSON-Schema inference is cheaper than auditing an
unmaintained transitive surface on a server that handles customer Apify
tokens.
- **Type-array unions for primitive conflicts.** `{x:1}` + `{x:"hi"}`
produces `{"type":["integer","string"]}` — spec-valid JSON Schema,
handled natively by LLMs reading the tool output. Verified the generated
schema is never Ajv-validated downstream (checked both this repo and
`apify-mcp-server-internal` — Ajv only validates tool *input* args).
- **`arrayMode` field removed from `get-dataset-schema`.** Technically a
public API change. Safe because (a) all 3 internal callers always passed
`arrayMode:'all'`, and (b) the `'first'` mode was never useful —
`to-json-schema` applies it recursively to nested arrays too, which is
almost never what callers want.
- **Drops the upstream's `format:"style"` false positive.** Free-form
Markdown text was being tagged with a CSS-ish format. The new format
detector covers only `uri`, `date-time`, `date`, `email`, `uuid` — the
unambiguous ones.
## Follow-up
- **#855 stacks a compact rewrite on top of this PR — same behavior, 387
→ 148 combined code LOC.** Merge order: this PR first, then #855. Or
squash-merge #855 alone to replace this. This PR ships the verbose,
easy-to-read version for review clarity.
---------
Co-authored-by: Jiří Spilka <jiri.spilka@apify.com>
jirispilka
added a commit
that referenced
this pull request
May 26, 2026
## Context
`get-dataset-schema`, `call-actor`, and `get-actor-run` all share
`generateSchemaFromItems`, which delegated to `to-json-schema@0.2.5`.
Any two items with different key sets collapsed to
`{type:'array',items:{type:'object'}}` — properties wiped out. Reported
on a real NYC restaurants dataset where ~half the items carried
`markdown` and half didn't.
## Solution
Replaced the library with an in-house inferrer in
`src/utils/schema_generation.ts`. The merge does a set-union of property
keys and recurses; primitive type conflicts emit JSON Schema `type`
arrays (e.g. `["string","null"]`). Drops the `arrayMode` field from
`get-dataset-schema` — it only existed as a workaround for the buggy
`mode:'all'`, and all internal callers were already passing it anyway.
## Worth your attention
- **No external dependency, no supply-chain surface.** `to-json-schema`
was last published in 2020 and the upstream repo is dead. Owning ~120
LOC of pure JSON-Schema inference is cheaper than auditing an
unmaintained transitive surface on a server that handles customer Apify
tokens.
- **Type-array unions for primitive conflicts.** `{x:1}` + `{x:"hi"}`
produces `{"type":["integer","string"]}` — spec-valid JSON Schema,
handled natively by LLMs reading the tool output. Verified the generated
schema is never Ajv-validated downstream (checked both this repo and
`apify-mcp-server-internal` — Ajv only validates tool *input* args).
- **`arrayMode` field removed from `get-dataset-schema`.** Technically a
public API change. Safe because (a) all 3 internal callers always passed
`arrayMode:'all'`, and (b) the `'first'` mode was never useful —
`to-json-schema` applies it recursively to nested arrays too, which is
almost never what callers want.
- **Drops the upstream's `format:"style"` false positive.** Free-form
Markdown text was being tagged with a CSS-ish format. The new format
detector covers only `uri`, `date-time`, `date`, `email`, `uuid` — the
unambiguous ones.
## Follow-up
- **#855 stacks a compact rewrite on top of this PR — same behavior, 387
→ 148 combined code LOC.** Merge order: this PR first, then #855. Or
squash-merge #855 alone to replace this. This PR ships the verbose,
easy-to-read version for review clarity.
---------
Co-authored-by: Jiří Spilka <jiri.spilka@apify.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Reviewer ceiling for
src/utils/schema_generation.ts+ tests is 150 LOC of code (comments / blank lines free). The fix in #854 lands at 387.Solution
Compact both files to 148 combined LOC with the same behavior. Stacked on top of #854 so the diff in this PR is purely the compression — no behavior change to review here.
src/utils/schema_generation.tstests/unit/schema_generation.test.tsWorth your attention
it.eachtables. Same regression coverage: NYC sushi case, heterogeneous keys, format false-positive guard.!non-null assertions added inmerge. Both inside branches guarded by&&/||predicates immediately above (ap[k] && bp[k]anda.items || b.items). The narrowing alternative was 4 LOC longer.JsonSchemaArray,JsonSchemaObject,SchemaGenerationOptions, and theremoveEmptyArraysexport were all unused outside the module.JsonSchemaPropertystays (imported byactor_execution.ts).Open